In this exercise, we will download a pair of files in fastq.gz format and explore their contents. We will generate graphs to synthetically represent the sequencing quality as well as the composition of nitrogenous bases.
# Windows:
# Click on the start button and type `cmd` in the search bar. Press Enter.
# macOS:
# Press Command (⌘) + space to open Spotlight Search. Type "Terminal" and press Enter.
# Windows and macOS: in your command prompt type:
python3 -m ensurepip --upgrade
python3 -m pip install --upgrade pip
python3 -m pip install Bio Numpy Matplotlib
# Linux: in your command prompt type:
sudo apt update
sudo apt install python3-pip
pip install Bio Numpy Matplotlib
The fastq files we will download are available on the ENA website. It is the European Nucleotide Archive, which is a comprehensive nucleotide sequence database, that includes raw sequencing data, sequence assembly information and functional annotation.
Go to the ENA database, and type the Saccharomyces cerevisiae in the search bar. You will get different blocks of results. In the Read block, click on the Run field and choose the first run accession that appears: SRR800768.
What sequencing platform was used to do the sequencing run?
What is the instrument model used?
What is the library type used for the sequencing? Paired-end? Single-end or mate-pair?
What is the type of molecules that have been sequenced?
A the bottom of the page, you can see two fastq file names.
What is the extension of these files? What does it indicate?
Are these files expected to be different in their contents? Justify your answer.
Select both of the files and click on Get Download script.
A file will be downloaded. You can open it with a text editor (like notepad++ or Visual Studio Code). In this file you can find the following command lines that will allow you to download both files:
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR800/SRR800768/SRR800768_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR800/SRR800768/SRR800768_2.fastq.gz
Open a command prompt and paste the command lines in your command prompt.
If 'wget' is not recognized, replace 'wget -nc' by 'curl -O' and paste the command lines in your command prompt.
To simplify the subsequent analyses, a subsampling from each of SRR800768_1.fastq.gz and SRR800768_2.fastq.gz was done using the seqtk library. Subsampling was done in a way to keep reads pairing (same number of reads in subsampled files, in the same order).
The subsampled files: SRR800768_1_sub.fastq and SRR800768_2_sub.fastq can be downloaded on icampus. We will use these files for the rest of the exercice.
Open a Python terminal. Do not forget to type all your commands in a text file with a .py extension.
Here is the libraries that need to be imported at the beginning of your python script:
from Bio import SeqIO
import gzip
import matplotlib.pyplot as plt
import numpy as np
Make sure that the Python session's working directory aligns with the location of this fatsq files. For that you can use the Python library os as follows:
import os
os.getcwd()
In case your fastq files are not located in the Python current working directory, you have two options: either copy them into the Python current directory, or modify the Python current directory to that of the fastq files using the following code:
new_directory = "/path/to/your/new/directory" # better if absolute path
os.chdir(new_directory)
Then, you will have to read both files using the following syntax.
# For gzipped files
records = SeqIO.parse(gzip.open('filename.fastq.gz', 'rt', encoding='utf-8'), 'fastq')
# For non-gzipped files
records = SeqIO.parse('filename.fastq', 'fastq')
# You can iterate over my file as follow:
for record in records:
seq = record.seq
phred_scores = record.letter_annotations['phred_quality']
0 to read length - 1) and on the y axis the percentages of each base (A in red, T in green, C in blue and G in black).
Estimate the GC content of the Saccharomyces cerevisiae sequenced genome and compare it with the known GC content of the species. 6. For each file, generate a plot illustrating the median, first quartile (Q1) and third quartile (Q3) of the Phred scores for each position in the read. The x-axis represents the position on the read, and the y-axis depicts the median (in purple), Q1 (in blue), and Q3 (in red) of the Phred scores.
Tip: you can use numpy to compute the Q1, Q2 and Q3 of a list:
my_list = [1, 2, 3, 3, 3, 5, 8]
# Q2 = median
my_median = np.median(my_list)
# Q1
Q1 = np.percentile(my_list, 25)
# Q3
Q3 = np.percentile(my_list, 75)
SRR800768_1_sub_clean.fastq and SRR800768_2_sub_clean.fastq.SRR800768_1_sub_sing_clean.fastq and SRR800768_2_sub_sing_clean.fastq.Tip: Here is a code snippet to create and fill a fastq file:
with open(path/to/my_new_file, 'w') as f:
for record in records :
SeqIO.write(record, f, 'fastq')
C.6.